An often overlooked but critical aspect of machine learning is model interpretability. While black box models may be used in some cases, most businesses want to be able to understand how exactly their models work and derive insights and improve their business based on these insights.
A good way to interpret model is using the feature importance.This project is about interpreting ML techniques using feature importance and selection.
Feature importance gives a relative ranking of the model features. This is an important step for business who want to analyse what the most important features are for their use case. And this might be more important to them than the exact methods used for modelling.
More the scores, the relative ranking of features is what usually matters.
Note:
Below, I have tried out various different ways of caalculation feature importance. Note: Although fetaure importance and selection could work for any type of algorithm, I have restricted my experiments to Random Forest with 100 trees. And I have used the boston dataset (regression) as an example.
About the dataset:
The target of this dataset is the median prices of Boston houses. The independent variables given are related to rooms per dwelling, crime rate and such statistics related to each house. There are 13 such features and we try to find the importance of these features and select them.
%run featimp.py
Spearman's correlation coefficient is calculated by dividing covariance of the two variables with the standard deviations of each of the two varriables. This value indiactes how positively or negatively corrleated the feature is with respect to target variable. This is an easy step to start off EDA and interpret models.
How I Implemented this
To implement this, I have iterated over each column of training data and computed the Spearman correlation coefficient with respect to target variable.
Below, I am calling my plot function for the boston dataset to get the Spearman correlation coefficient.
# get dataset
boston = load_boston()
y = boston.target
X = pd.DataFrame(boston.data, columns = boston.feature_names)
plot_spearman(X,y)
mRMR (minimal-redundancy-maximal-relevance) is a feature selection technique which also takes in to consideration the codependencies within the feature variables. So it takes care of both the 'relevance' to the target as well as the 'redundancy' with respect to the features themselves. (i.e codependent features)
For this implementation I have used the Spearman correlation from above as the Importance metric and then used the following formula below to compute the mRMR value.
How I Implemented this
S is the set of selected features. This set is my output after iterating over all the columns in the dataset.
Initially, S is an empty set.
Using the above formula, we calculate the J value for each column x_k and choose the feature with maximum value to add to set S.
The same process is repeated, and elements are added to S one by one as per the mRMR algorithm.
The final set S has the order in which features were selected which is essentially like a rank of selection.
mrmr(X,y)
The above list of features is selected in decreasing order as per the mRMR algorithm. Based on this set, the top-k features can be selected.
Here, I have tried using the default RF importance package to calculate the feature importance scores.
plot_rf(X,y)
Image source https://explained.ai/rf-importance/index.html
From the above blpgpost, (https://explained.ai/rf-importance/index.html) let's explore and understand the issue with the default importances in RF package.
The package computes the gini importance which is biased and does not perform well with high cardinality variables. This is a major drawback of this method of computation. However, the values may be more consistent when the variables are normalized (which is not an usual practice done for tree models in the least)
Drop column is a simple method which calcualtes the evaluation metric with and without a feature to compute the feature importance.
How I Implemented this
plot_drop(X,y)
The concept here is the column is shuffled randomly so as to destroy the relationship with target variable. A high change from the baseline or original model (with column intact) indicates that the column is in fact important. However, if there is not much of a difference, this means that the feature is not predictive. Since we don't retrain the model, this is more efficient than drop column.
How I Implemented this
plot_permute(X,y)
Unlike drop-column, in case of permutation importance, codependent features share an equal importance and don't zero out. However, the issue is the importance can be over-estimated sometimes when codependent features are present.
There many good packages in Python for feature importance visualization, like:
Out of these, I found the shap package to have impressive visauliztions and below I tried using it on my dataset with RF model.
import shap
# load JS visualization code to notebook
shap.initjs()
# train XGBoost model
X,y = shap.datasets.boston()
rf.fit(X,y)
# explain the model's predictions using SHAP
# (same syntax works for LightGBM, CatBoost, scikit-learn and spark models)
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X)
# visualize the first prediction's explanation (use matplotlib=True to avoid Javascript)
shap.force_plot(explainer.expected_value, shap_values[0,:], X.iloc[0,:])
# summarize the effects of all the features
shap.summary_plot(shap_values, X)
shap.summary_plot(shap_values, X, plot_type="bar")
Below, I have selected 7 top features based on :
Using these features, I then compared the validation metric for all the different methods. I tried this with 2 models - RF and OLS. The plots are below. I observed that Drop column, Permutation importance and Spearman gave the same feature selections, hence giving the same graphs.
# Random Forest
plot_compare(rf)
# OLS
plot_compare(ols)
To automatically select features, I used the drop column importance that I implemented earlier. I computed my initial baseline score (OOB score).I dropped the least important feature and retarained the model to get the validation score. Once the validation score is less than baseline, the process ends. To visualize this, I have plotted the validation scores. Then I chose the feature set with maximum validation score.
plot_auto()
The top features selected such that the metric is maximum is:
auto(X,y)
By bootstrapping the data and retraining the model, we can get the variance in importances.
For implementing this I have used the drop column importance and visualized the standard deviation for each feature.
plot_var(X,y)
The following were the key takeaways from this project for me:
Feature importance is a great tool to interpret machine laerning models and derive business inisghts.
Feature importance from models which have a higher bias/ weak models cannot be trusted.
Importances must be computed from validation and not training set. (Or oob score in case of RF works)
Importances may vary from model to model and cannot be reused while training different models.
By selecting features and identifying importance, we can exaplin or interpret ML models to business stakeholders. Visualizing them is a good way to communicate the results.